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This work concerns controlled Markov chains with finite state and 
action spaces. The transition law satisfies the simultaneous Doeblin 
condition, and the performance of a control policy is measured by 
the (long-run) risk-sensitive average cost criterion associated to a 
positive, but otherwise arbitrary, risk sensitivity coefficient. Within 
this context, the optimal risk-sensitive average cost is characterized 
via a minimization problem in a finite-dimensional Euclidean space. 


1. Introduction. This work concerns discrete-time Markov decision pro¬ 
cesses (MDPs), where the controller selects actions from a finite set, and the 
corresponding controlled process takes values on a finite set S. The decision 
maker is supposed to be risk-averse with constant risk sensitivity coeffi¬ 
cient A > 0, and the performance index of a control policy is measured by 
the (long-run) risk-sensitive average cost criterion. Under the simultaneous 
Doeblin condition in Assumption 2.1, the main result of the paper, stated 
as Theorem 3.5, provides a characterization of the optimal value function 
J*(A,-) for arbitrary A > 0. Roughly, this theorem shows that the optimal 
value function is the infimum of a family Q of functions on the state space, a 
conclusion that, as described in the following section, is similar to results al¬ 
ready available for classical risk-neutral criteria. However, at the same time 
this characterization reflects an interesting and important contrast with the 
risk-neutral average cost index which is illustrated in Example 2.2, namely, 
when A is large enough, the costs incurred while the system stays at transient 
states, which can be visited only at “early stages” of the decision process. 
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have a definite impact in the risk-sensitive average performance criterion. 
This feature implies that, even when the Markov chain associated with each 
stationary policy has a single recurrent class, the risk-sensitive optimal av¬ 
erage cost is not necessarily constant, and that in this case the optimality 
equation may have no solution at all. Such a potentially complex behavior 
of J*(A, ■) when A > 0 is unrestricted is actually covered by the characteriza¬ 
tion in Theorem 3.5, and highlights the main difference between the results 
in this paper and those already available, which concern the case in which 
A is small enough to guarantee that J*(A,-) is constant and its value is de¬ 
termined via the optimality equation; see, for instance, [3] or [14] for the 
discrete case, or [ 8 ] for MDPs over Borel spaces. 

The study of MDPs endowed with the risk-sensitive average criterion can 
be traced back, at least, to the seminal work of Howard and Matheson [17], 
where models with finite state and action spaces were studied assuming the 
following condition (C): Under each stationary policy the whole state space 
is an aperiodic communicating class. In this context, the Perron-Frobenius 
theory of positive matrices [7] was used to show that, for every A > 0, the 
A-sensitive average cost associated to each stationary policy is a constant 
function, and its value 7 can be characterized via the corresponding Poisson 
equation; see also [11]. The Perron-Frobenius theory provides also a link 
between risk-sensitive control and the Donsker-Varadhan theory of large 
deviations [9]. It is well known that, under suitable recurrence conditions, 
the occupation measure of a Markov process satisfies the large deviation 
principle, with rate function given by the convex conjugate of a long-run 
expected rate of exponential growth function. It is also worth mention that 
some optimal investment models can be formulated as risk-sensitive control 
problems, for assets dynamics models affected by economic factors, where 
the goal is to maximize the growth rate of the expected utility of wealth 
[1, 2, 12]. This kind of problems are also linked with the deterministic model 
of optimal economic development proposed by Gale and Neumann [10, 13]. 

The organization of the paper is as follows. In Section 2 a formal de¬ 
scription of the model is presented, the potentially complex dependence of 
J*(A, •) on A > 0 is explicitly shown and, after describing the main theorem, 
an outline of the strategy that will be used to prove the characterization 
result is given. In Section 3 a fundamental min-max equation satisfied by 
the optimal value function is established, and such an equality is used as one 
of the conditions in the definition of the family Q in terms of which J*(A, •) 
is characterized in Theorem 3.5. After identifying the difficulties in proving 
this result, the necessary technical preliminaries are established in Sections 
4-6 and, finally, the main theorem is proved in Section 7. 

Notation. Throughout the remainder M and N stand for the set of real 
numbers and nonnegative integers, respectively. Given a finite set 5, the 
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space of all real-valued functions defined on S is denoted by B{S), and for 
each C € B{S) 

lie'll :=max|e('«;)| 

wGS 

is the corresponding maximum norm. The indicator function associated to 
an event W is denoted by I[W] and, even without explicit reference, all 
relations involving conditional expectations are supposed to hold almost 
surely with respect to the underlying probability measure. 

2. Decision model and outline of the work. Let an MDP be specified 
by M = {S,A,{A{x)},C,P) where the state space S and the action set 
A are finite sets endowed with the discrete topology and, for each x £ S, 
A{x) C ^ is the nonempty subset of admissible actions at state x; the set K of 
admissible pairs is defined by K := {(x,a)|a G ^(x),x G 5} and is considered 
as a topological subspace of jS x On the other hand, C: K —> M is the 
one-step cost function, and P = \pxy{-)] is the controlled transition law. The 
interpretation of M is as follows: At each time t G N the state of a dynamical 
system is observed, say Xt = x £ S, and an action At = a£ A{x) is chosen. 
Then a cost C{x,a) is incurred and, regardless of the previous states and 
actions, the state of the system at time t -£ 1 will be Xt+i = y £ S with 
probability pxy{a)] this is the Markov property of the decision model. 

Policies. For each t G N the space Bit of admissible histories up to time t 
is recursively defined by Hq '-=3, and Bit := IK x BIt_i for t > 1. A generic ele¬ 
ment of Mt is denoted by ht = (xq, uq, xi, a*,..., xt-i,at-i,xt), where Xn £ S 
for n <t, and a, G A{xi) for i <t. A policy n = {irt} is a special sequence 
of stochastic kernels: For each t G N and G Ht, Trt{-\ht) is a probabil¬ 
ity measure on A concentrated on A{xt). The class of all policies is de¬ 
noted by V. Given the policy ir £V used to drive the system and the initial 
state Xq = X £ S, the distribution of the state-action process {(At, At)} is 
uniquely determined via lonescu Tulcea’s theorem (see, e.g., [[15]] or [[18]]); 
such a distribution will be represented by P^ , whereas stands for the 
corresponding expectation operator. Throughout the remainder p denotes 
the information vector up to time t, which is given by 

-^o = Ao and A := (Aq, Aq, ..., At_i, At_i, At), t = l,2,3,- 

Next, define F :=na;G 5 ^(^) so that F consists of all (choice) functions 
f'.S^A satisfying that /(x) G A(x) for each x G S'. A policy tt is sta¬ 
tionary if there exists / G F such that, when the system evolves under vr, at 
each time t G N the action applied is determined by At = /(At); the class of 
stationary policies is naturally identified with F and, with this convention, 
F C P. 
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Performance index. As already noted, the controller is assumed to be 
risk-averse with constant risk sensitivity A > 0, that is, when facing a random 
cost y, she grades it through E\e^'^]. The certain equivalent of the random 
variable Y is the (possibly extended) real number defined by 

E{\Y)-.= j\og{E[e^'^]), 

so that = E[e^^], and then the controller is indifferent between 

incurring the random cost Y or paying the certain equivalent E{X,Y) for 
sure. 

When the system evolves under n £ V and x € S' is the initial state, 
Jn(A,7r,x) denotes the certain equivalent of the total cost incurred before 
time n > 0, that is, 


( 2 . 1 ) Jn{X,7r,x):=jlog(^Ef 


n—1 


exp<^ Xj2C{Xt,At 


t=o 


whereas the (long-run expected A-sensitive) average cost under vr starting 
at X is defined by 

(2.2) J(A,7r,x) := limsup — Jn(A, TT, x). 

n—^oo Ti 


The optimal (A-sensitive) average cost at state x is given by 


(2.3) J*(A,x) := inf J(A,TT,x), 

TT 

and a policy vr* G "P is optimal if J(A, tt*, x) = J*(A, x) for every x € S. Given 
e > 0, a policy vr is e-optimal at state x G S if J(A,7r,x) < J*(A,x) -|- e; if 
the policy vr is e-optimal at every state, then vr is e-optimal. The following 
simultaneous Doeblin condition will be assumed throughout the sequel. 


Assumption 2.1. There exists a state z € S and M G (0,oo) such that 
El[T]<M, x£Sj£¥, 


where 

(2.4) T := min{re > 0|A„ = z} 

is the first positive arrival time to state z and, by convention, the minimum 
of the empty set is oo. 
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The problem. As already mentioned, the main objective of the paper 
is to provide a characterization of the optimal value function for 

arbitrary A > 0. This problem has recently received considerable attention 
in the literature and, under the above simultaneous Doeblin condition, the 
results already established can be described as follows: if A > 0 is sufficiently 
small, then the optimal value function J*(A, •) is constant and, moreover, its 
value 7 is the unique real number for which there exists /i: 5 —> M satisfying 
the optimality equation 


(2.5) 


gA[ 7 +/i(x)]^ min 
aGA{x) 




Xh{y) 


X ^ S: 


see [3, 5, 14], Also, modulo an additive constant, the relative value function 
h{-) in this equation satisfies that for each x € S, 


( 2 . 6 ) 


h{x) 


= inf — log 
Tre-p A 



{ T-l 

t=0 


= inf — log 
irev A 




T-l 

Xj2[C{Xt,At) 


t=0 




where T is the hitting time in (2.4). However, the situation is substantially 
different when A > 0 is arbitrary in that (i) Assumption 2.1 does not gener¬ 
ally imply that J*(A, •) is constant, (ii) the rightmost term in (2.6) may be oo 
and, moreover, (iii) even when the optimal value function takes on a single 
value 7 , it is not necessarily determined by (2.5). This potentially complex 
behavior, which does not occur under Assumption 2.1 when the performance 
index is the risk-neutral average cost, is illustrated in the following example 
along the lines of Example 2.1 in [6]. In all, this example shows that, when 
the risk sensitivity coefficient is large enough, the behavior of the system at 
transient states, which may be occupied only at “early stages,” has an im¬ 
portant and definite influence on its performance, establishing a remarkable 
difference with the risk-neutral case. 


Example 2.2. Let S = {0,1,2} and A = {0,1}. The sets of admissible 
actions are given by A(0) = A(2) = {0} and A(l) = {0,1} = A, whereas the 
cost function always satisfies C{x,a) =x. Einally, for some p € (0,1), the 
transition law is determined by 

4*00(0) = 1, p22(0) = p^ = l-p2o(0) 

and 


4*12(1) = 1, 


4*ii(0) =p = 1 - pio(O). 
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In this context it is not difficult to see that Assumption 2.1 is satisfied 
with z = 0. Now, let / be the stationary policy determined by /(I) = 0 so 
that, since o = 0 is the unique action available at the absorbing state 0 and 
(7(0,0) = 0, it follows that J*(A,0) = J(A,/,0) = 0. Assume now that 

(2.7) e^p>l. 


Using that 0 is the unique available action at state 2 and that when the 
system leaves state 2 it reaches z = 0, where a null cost is incurred forever, 
it follows that J*(A,2) = J(A,/,2), whereas for each positive integer n. 


Ei 


n—1 


t=0 


exp\Xj2CiXt,A, 

exp!^XY^^C{Xt,At)'jI[T = k] 
exp|A^U(Xt,A)|/[r> 




k=l 


+Ei 


k-l 


t=0 


k=l 

(gA^)2n _ ^ 

{e^pY — 1 


= {e^pY'^ + p^) 


and then (2.1) and (2.2) together lead to J(A, f,2) = j log[(e^^p^)] = j log{e^p) 
0. A similar argument shows that J{X,f, 1) = ^log(e'^/9) > 0 and, since ap¬ 
plying action 1 at state 1 produces a transition to state 2, where the optimal 
average cost is j log(e'^p) > J(A, /, 1), it follows that / is also optimal at state 
1. In short, under (2.7), 


J*(A,0) = 0 < ilog(eV) = 1 + 1) < 2J*(A, 1) = J*(A,2). 

Notice that the system will be ultimately absorbed by state z = 0 but, 
when (2.7) holds, the costs incurred at the transient states have a defi¬ 
nite influence on the performance of the system. Assume now that the ini¬ 
tial state is Aq = 2. From the specification of the model, it follows that 
Xt = 2 for t < T, with T as in (2.4) with z = 0, and in this case C(Xt,At) — 
J*{X,Xt) = 2-2(1 -hlog(/9)/A) = -21og(p)/A, so that XJ2lJo^[C{Xt, At) - 
J*{X,Xt)] = —2Tlog(/?). Therefore, the relative value function at state x = 2, 
given by the rightmost term in (2.6), is h{2) = = 

“ P^) = oo; similarly, it can be established that h{l) = oo. 
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On the other hand, it is interesting to observe that there is not any function 
h: S —fM. satisfying that 

(2.8) (a,2)+aM 2) > gAC(2,0) ^ (o)gAMj/). 

y 

indeed, the left-hand side of this inequality is whereas the right- 

hand side satisfies [p 22 ( 0 )e'’^^^^^ -|-p 2 i( 0 )e'''^^'^)] > = 

When the risk sensitivity coefficient satishes e^p= 1, similar 
calculations yield that (i) J*(A, •) = 0 = 7, (ii) the relative value function h 
in (2.6) is 00 at X = 1 and 2, and (hi) inequality (2.8) is not satished by 
any function /i: ^ > M; in particular, even in this case in which the opti¬ 

mal average cost is constant, the optimality equation (2.5) does not have 
a solution. Finally, if A satisfies that e^p < 1, which in this example is the 
precise meaning of “if A is sufficiently small,” the optimal value function is 
identically 0 = 7, the relative value function in (2.6) is finite, and the pair 
(7, /i(-)) satisfies the optimality equation (2.5); see [3] or [14] for these latter 
assertions. 

The characterization theorem. The main result of this work, which is 
formally stated as Theorem 3.5 in the following section, provides a charac¬ 
terization of J*(A,-) covering the diversity of possible behaviors illustrated 
in Example 2.2. For each A > 0, this theorem determines the optimal value 
function in terms of a class of functions G, and establishes that J*(A, •) is the 
infimum of such a family. This conclusion is similar to the characterization 
of the optimal (risk-neutral) total expected cost V* for MDPs with nonneg¬ 
ative cost function; in this latter case, V* is the infimum of all nonnegative 
functions W defined on the state space and satisfying W > VW, where T is 
the corresponding dynamic programming operator; see [18] for details. How¬ 
ever, for the risk-sensitive average criterion in this work, the construction of 
family G involves two conditions, resembling the two equations that charac¬ 
terize the optimal risk-neutral average cost in multichain MDPs, for which 
the optimal performance index is not necessarily constant (see, e.g.. Chap¬ 
ter 9 in [18]). The first restriction imposed on the members of G reflects 
a fundamental property of the risk-sensitive average index, namely, if the 
system is driven by a “good” policy, then {J*(A,Xt)} is nonincreasing for 
almost all sample trajectories. This property is a consequence of Lemma 3.1 
in the following section, establishing that the optimal value function satisfies 
a min-max equation, and the first condition imposed on the members of G is 
to satisfy such an equality. The second condition on a function g £G is mo¬ 
tivated by the optimality equation that, at least formally, is associated with 
this optimal control problem. This condition guarantees that g is really an 
upper bound of J*(A, •); it was also used in [6] to analyze the uncontrolled 
case, and requires the existence of a (deviation) function h : S such 
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that the pair {g{-),h{-)) satisfies (3.4), which is analogous to the condition 
W > T^W mentioned above. 

Outline of the argument. As might be expected from the diversity il¬ 
lustrated in Example 2.2, characterizing J*(A,-) for arbitrary A > 0 is a 
somewhat technical task, so that it is convenient to give a brief outline of 
the argument used to achieve this goal. In Section 3 the basic min-max 
equation satisfied by J*(A, •) is established, and then the family of functions 
Q is introduced. Next, it is shown that the optimal A-sensitive average cost is 
a lower bound of Q, and the characterization result of J*(A, ■) as the infimum 
of Q is stated as Theorem 3.5. As it will be noted below, in general J*(A, ■) 
does not belong to but the strategy to establish Theorem 3.5 consists in 
showing that, for each a € (0,1), the function g(-) = aJ*(\, •) -|- (1 — a) He'll 
lies in Q, from which Theorem 3.5 follows immediately. The main difficulty 
in establishing this inclusion is to prove that there exists a deviation func¬ 
tion h: (S' —> M such that the second condition in the definition of family Q is 
satisfied. In Definition 4.1 a candidate h for the deviation function for the 
function g above is introduced, and from that point onward, the effort is 
mainly dedicated to establishing that h{-) is a hnite function, a fact that is 
proved in two steps: In Theorem 4.4 it is shown that h is finite at the points 
X where the optimal value function is minimized, whereas in Theorem 5.1 
this conclusion is extended to the whole state space. The argument in this 
part relies heavily on the following property: Under an e-optimal policy with 
e > 0 small enough, along almost all trajectories the optimal value function 
is dominated by its value at the initial state. Section 6 concerns a last tech¬ 
nical point on the function h introduced in Definition 4.1, namely, that h{z) 
is nonpositive, where z is as in Assumption 2.1. After the preliminaries in 
Sections 4-6, Theorem 3.5 is finally proved in Section 7. 

Before leaving this section, it is convenient to point out the following 
observation. 


Remark 2.3. (i) Given e > 0, an e-optimal policy exists. Indeed, from 

the definition of J*(A, ■) in (2.3), it follows that for each x £ S there exists 
a policy £V which is e-optimal at x, that is, 

J{X,7r^,x) < J*{X,x) + e 


and a new policy vr can be defined as follows: For each t gN and G 
7rt(-|ht) = 7r^°(-|ht). A controller driving the system according to vr first 
determines the initial state, and then picks the actions according to vr^ if 
Aq = X is observed. From this construction it follows that the equality 


expi Xj2CiXt,At 


t=o 


= El 


expi A^C'(Ai, Ai 


t=o 
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is always valid, and then (2.1) and (2.2) together yield that J{X,Tr,x) = 
J{X,TT^,x) < J*{X,x) +£ for every state x, so that vr is e-optimal. 

(ii) From (2.1)-(2.3) it is not difficult to see that — HCH < J*(A, •) < ||C||. 


3. Min-max equation and main result. According to the program out¬ 
lined above, in this section the characterization result for the optimal value 
function is stated. First, it is shown in the next lemma that the fundamental 
min-max equation is satisfied by the optimal value function, and such an 
equality is used as one of the requirements in the definition of the family of 
functions Q involved in the characterization of J*{X, •). 

Lemma 3.1. For each A > 0, the function J*{X,-) in (2.3) satisfies the 
following min-max equation: 

J*{x)= min max{J*(y)|p 2 .„(a) > 0}, x ^ S. 

aeA{x) 


Proof. Let {x, a) G K and e > 0 be arbitrary but fixed, and let vr G P be 
an e-optimal policy (see Remark 2.3). Next, select a policy / G F satisfying 
that f{x) = a, and define the new policy tt G P as follows: 7ro({/(a:o)}|xo) = 1 
for each xq G S, whereas for each t G N and hj+i G Hi+i, 

fr4-|-l(-|h4+i) = 7rt(-|xi, Ol, ... ,Xt+l). 


When the system is driven by if, the action applied at time zero is selected 
using /, whereas from time 1 onwards, the controls are picked using the 
e-optimal policy vr as if the decision process had started again at time 1. 
The Markov property and (2.1) together yield that for every positive integer 
n, 


^\Jn + liX,TT,x) _ 



n 

Xj2CiXt,At) 

t=o 


= e^^^x,n^^^J2P^y{f{x))E; 
y 


exp 


n—1 

XY,C{Xt,At) 

t=o 


— Q^C{x,a) 


^Pxy{a)e 




so that 

Jn-n(A,7f,x) ^ C(x,a) ^ n 


n -|- 1 


n -|- 1 n -I- 1 






1 l/(An)N 


On the other hand, since tt is e-optimal and S is hnite, it follows that for 
some no £ 


^(A,7r,-) <n(J*(A,-) +e) 


n > no. 
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Therefore, < '^yPxy{o)e ^'^^'^*when n > no, and 

it follows that 


lim sup 


l/(An) 

< lim sup 


n—»-oo 

- y 

n—^oo 

- y 


= max{e'^*^^’^^+^|p,j,y(a) > 0} 

_ gmax{J*(A,j/)+e|pxy(a)>0}^ 

where the second eqnality is due to the fact that the exponential function is 
increasing. Combining this with (2.2), after taking limit superior as n goes 
to oo in (3.1) it follows that 

J*(A, x) < J(A,TT, x) < max{J*(A, y) + e\pxy{a) > 0}; 

see (2.3) for the first inequality. Recalling that e > 0 is arbitrary, this yields 

J*(A,x) < max{J*(A,?/)|pa;j/(a) > 0}, 

a relation that, since {x,a) S K is arbitrary, implies 

(3.2) J*{X,x)< min max{J*(A,y)|pa;p(«) > 0}, x^S. 

aGA(x) 


To establish the reverse inequality let x G S and tt G P be arbitrary. Select 
b G A{x) satisfying 7ro({6}|x) > 0, and let y G S be such that Pxy{b) > 0. 
Combining (2.1) with the Markov property, it follows that for every positive 
integer n. 


(A,7r,3}) ^TT 


exp 


n 

\J2C{Xt,At) 

t=o 


> El 


ew\xJ2C{Xt,At) 
[ t=o 


I[Ao = b,X^ = y] 


= M{b}\x)pxy{b)e^‘^^^’’'^ Ey 


exp 


n—l 

Xj2C{Xt,At) 


L 1 4=0 IJ 

where the “shifted” policy 5 is defined as follows: For every t G N and G 
(5t(-|ht) =TTt+i{-\x,b,ht). Therefore, 


Jn+l ( A, TT, x) 


> 


1 


■log(7ro({6}|x)pa;p(^')e^‘^^"'’''^) + 


n + 1 A(re + 1) 
and taking limit superior as n goes to oo, it follows that 
J(A,7r,x) > J{X,6,y) > J*(A,?/); 


n Jn{X,d,y) 
n + 1 n 
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see (2.2) and (2.3). Since the state y satisfying Pxy{b) > 0 is arbitrary, this 
implies that 

J(A, 7 r,x) > max{J*{X,y)\pxy{b) > 0}, 

and then 

J(A,7r,a;)> min max{J*{X,y)\pxy{a) > 0}. 
a£A(x) 

Since this holds for every tt £V and x € S, (2.2) yields that 

J*{X,x)> min max{J*{X,y)\pxy{a) > 0}, xGS, 

a£A{x) 

and the result follows combining this inequality with (3.2). □ 


Definition 3.2. The class G consists of all functions g G B{S) satisfying 
the following conditions; 


(i) For each x £ S 


(3.3) 


six) = min m.ax{giy)\pxy(a) > 0}. 

aGA(x) 


(ii) There exists a function h £ 13{S), possibly depending on g, such that 


(3.4) 

where 


^Xgix)+\h{x) > 

a£Bg(x) 


^\C{x,a) 




\h(y) 


X £ S, 


(3.5) Bg{x) := {a £ Aix)\g{x) =max{giy)\px,yia) >0}}; 

a function /i(-) satisfying (3.4) will be referred to as a deviation function 
associated to gG). 


Remark 3.3. (i) Given g £ B(S) satisfying (3.3), the finiteness of the 

action sets Aix) ensures that each set Bgix) is nonempty. 

(ii) Family Q is nonempty. In fact, if g'(-) = HGH, then g £G, since (3.3) 
is clearly satished by this function, whereas (3.4) holds with /i(') = 0. 


The following lemma shows that the optimal value function is dominated 
by each member of Q. 


Lemma 3.4. (i) Suppose that 5 : S' — > M satisfies (3.3) and for each x £ S 

let Bg{x) C A{x) be as in (3.5). Given x £ S, assume that the policy 6 £V 
satisfies that P^lAr £ Bg{Xr)] = 1 for every r G N. In this case, when x 
is the initial state and the system is driven by 6, the process {g{Xf)} is 
nonincreasing almost surely. More precisely, for each n G N, 

siXn+i) < g{Xn) < ■ ■ ■< giXo) = g{x), P^-a.s. 

Consequently, 
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(ii) Every g & G is an upper bound of the optimal value function J*{X, •). 


Proof, (i) Let t € N be fixed, and suppose that w,y S satisfy P^iXt = 
w,Xt+i = y] > 0. In this case there exists a G Bg{w) such that P^l^t = 
w^At = a,Xt^i = y] > 0, since P^iAt G Pgi^t)] = Ij and then 

0 < Pj [Xt = vu,At = a, Xt+i = y] 

= Pi[Xi+i = y\Xt = w,At = a]P^[Xt = w,At = a] 

= Pwy{a)P^[Xt = w,At = a], 

where the second equality is due to the Markov property; therefore, 


Pwyiof) > 0 . 

On the other hand, from (3.5), the inclusion a G Bg{w) yields that 

g{w) =max{g{z)\p WZ (a) > 0} 


and combining this with the above inequality, it follows that g{w) > g{y). 
In short, it has been shown that 

P^[Xt = w,Xt+i=y\>t) g{w)>g{y) 

and it follows that P^{g{Xt-\-i) < g{Xt)\ = 1. Since t G N is arbitrary, this 
yields that P^[g{Xn+i) < g{Xn) < ■ ■ ■< fi'(A'o)] = 1 for each n G N. 

(ii) Let y G 1/ be arbitrary, and select h G 13{S) as in (3.4). For each x € S, 
let f{x) G Bg{x) be a minimizer of the term within brackets in (3.4), so that 
for every x £ S, 

^Xg(x)+Xh{x) > 

y 


which is equivalent to 9 (Ao)]gAh(Xi)j. point, 

an induction argument yields that 


eXh{x) > 


exp\xJ2[C{Xt,At)-g{Xt)] 
[ t=o 


,Xh{Xr, + l) 


x G S,n G N. 


Observe that, by part (i), under the action of policy / the inequalities 


giXn) < g{Xn-i) < ■< y(Xi) < g{Xo) 


hold with probability 1 regardless of the initial state. Therefore, ^t) ~ 

gi^t)] P -^t) — (n + l)y(Xo), so that for each n G N and x £ S, 


eXhP) > El 


exp 


XY,C{Xt,At)-{n + l)g{Xo) 

t=o 


gA/x(A„+i) 
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exp' 




t=o 




exp< X^C{Xt,At 


t=Q 


> gAJ„+i(A,/,x) ^ 


Hence, 

, \\h\\+Hx) ^ Jn+l{\f,x) 

(j\X) + ^ A - ) 

n + 1 n + 1 

and taking limit superior as n goes to oo, this yields that g{x) > J(A, /, x) > 
J*{X,x); see (2.2) and (2.3). Since a; G 5 is arbitrary, it follows that g{-) > 
J*{X,-). □ 


According to this result, the functional J*{X,-) is a lower bound for 
each member of Q. On the other hand, although J*{X,-) satisfies (3.3), by 
Lemma 3.1, in general this optimal value function does not belong to G. In¬ 
deed, in the context of Example 2.2, it was shown that when e^p > 1 there 
is not any function h such that (2.8) is satished, and this implies that the 
second part of Definition 3.2 fails for the function J*(A, •). However, under 
Assumption 2.1, the main result of this work asserts that J*(A,-) is the 
largest lower bound of G. 


Theorem 3.5. Under Assumption 2.1, for each x a S, 

J*{X,x) = mig{x). 
g&G 

This result extends Theorem 2.2 in [6] where the uncontrolled case was 
analyzed. The somewhat technical proof of this theorem will be given in 
Section 7 after establishing the necessary technical preliminaries in the fol¬ 
lowing three sections. Essentially, although it cannot be ensured that the 
optimal value function is a member of G, the idea is to prove that, for each 
a G (0,1), the function g specified by 

(3.6) ff(-):=aJ*(A,-) + (l-«)||C|| 

lies in a fact that immediately yields Theorem 3.5. Using Lemma 3.1 it 
is not difficult to see that this function g satisfies the min~max equation 
(3.3) and then, to establish the inclusion g (zG it is sufficient to show that 
there exists a deviation function h G 13{S) associated to g, so that the pair 
{g,h) satishes (3.4). The proof of this existence result requires an important 
technical effort that is presented in the following three sections. Throughout 
the remainder Assumption 2.1 is supposed to hold even without explicit 
reference, and a G (0,1) is arbitrary but fixed. 
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4. Deviation function. In this section a candidate h{-) for the deviation 
function of the function g in (3.6) is introduced and, as already mentioned, a 
major objective is to show that such a function is finite. Although this goal is 
finally achieved later, the main result of this section, stated as Theorem 4.4, 
is a hrst step in this direction. 


Definition 4.1. (i) For each x G 5, define B*{x) := see 

(3.5). 

(ii) The class V* consists of all policies n £V satisfying 

P^[At€B*{Xt)] = l, x€S,teN. 

(hi) Given a fixed real number a G (0,1), the corresponding deviation 
function h : S' —> [—oo, oo] is defined as 


h{x) 

(4.1) 


inf — log 

vrG-P* A 



exp 


T-I 

\aY,[C{Xt,At) 


t=o 




X G S, 


where T is the first positive passage time to state z; see (2.4). 


Notice that a G B*{x) if and only if J*(A,x) = max{J*(A,y)|pa;y(a) > 0}, 
and that V* is the class of policies for the MDP (S, A, {B*{x)}, C, P), which 
is obtained by restricting the set of admissible actions at state x to the 
subset B*{x). On the other hand, observe that the factor Aa is used in the 
exponential inside the expectation in (4.1); when a = 1, it is not difficult to 
see from Example 2.2 that h{-) may take on an infinite value at some points. 
However, in the present case in which a lies in (0, 1), it will be proved that 
h{-) is finite. The key tool in the argument leading to this goal is the following 
consequence of Assumption 2.1. 


Lemma 4.2. Under Assumption 2.1 there exist f3 G (0,1) and Po > 0 
such that 

P^[T>n]<Pop^, xeS,7reV,n€N. 

A proof of this lemma can be seen, for instance, in [16] or [19]. Using this 
result, it is now shown that — oo is not a value of the function h{-) in (4.1). 


Lemma 4.3. Let a G (0,1) be fixed and let fio and he as in Lemma 4.2. 

(i) There exists a positive constant Bq such that for each x & S and 
TT G P, 


e: 


T-l 


exp Aa^[C(Ai,Ai)-J*(A,At)] 


t=o 


> Bq- 


Consequently: 
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(ii) h{-) > —oo. 

(iii) Set 

^ (l-a)log(/3) 

(4.2) {„ =--. 

If IT gV is e-optimal, where e G (0, o-nd x G S is such that 

J*(A,x) =7, 


then 



Proof, (i) Let Nq gN he such that < 1/2, and observe that 

the inequality 

p;[r<iVo]>^ 

always holds by Lemma 4.2. On the other hand, using Remark 2.3(ii), it 
follows that 

^ [C{Xt,At) - r{X,Xt)] > -2\\c\\T, 

t=o 

so that for every x G S and it gV, 


El 


{ T-l 

XaY^[C{Xt,At)-r{\,Xt)] 

t=o 


>El\^ 


TT r^—2 Aq;||C||Ti 


No 


> ^ Pf[T = 

/c=0 


and then 


El 


exp 


{ T-l 

t=0 


[c{Xt,At)-r{x,Xt)] 


> e-^^\\‘^\\Nop;f[T < No] 

„-2A||C||Ao 


> 


= :Bn 


(ii) Combining part (i) with (4.1), it follows that h{-) > — log(Ro) > —oo. 
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(iii) Observe that Holder’s inequality implies 


e: 


exp I Xa [C{Xt,At) - 7 ] 


T-l 


t=o 


00 

n=l 
00 / 

n=l \ 


exp 


f n—1 

i Aa^[ 


[CiXt,At)-^]\l[T = n] 


t=o 

n—1 


exp' 


XY[C{Xt,At)-^] 


t=o 


iP^[T = n]) 


1—Q 


and then (2.1) and Lemma 4.2 together yield 


exp 


T-l 

XaY[CiXt,At)-^] 


t=o 


00 

'Y pn{l-a) 

n=l 


Since J*{X,x) =7 and tt is e-optimal, it follows that J„(A, 7 r,x)/n < 7 + e 
when the positive integer n is large enough, say n> uq. Therefore, 

gAo[J„(A,7r,x)-n7]^n(l-a) ^ ^\aenpn{l-a) ^ ^ ^ 

On the other hand, since 0 < e < ^Oj (4.2) implies that < 1, so 

that the last two displayed relations together yield that [exp{Aa [C{Xt, At) 
7 ]}] < 00 . □ 


In contrast with the above argument used to establish the inequality 
/i(') > — 00 , the proof of the inequality h{-) < 00 is substantially more tech¬ 
nical. As a starting point, the main result of this section, stated in the 
following theorem, establishes that function h{-) is finite at the points where 
the optimal value function attains its minimum value. 


Theorem 4.4. Let 70 be the minimum value of J*{X,-). In this case: 

(i) J*{X,z) = 7 o, where z is as in Assumption 2 . 1 . 

(ii) //J*(A, x) = 7 o, then h{x) is finite. 


The proof of these results relies on the technical preliminaries in the fol¬ 
lowing two lemmas; the first one provides a bound for {J*{X, Xt)} when the 
system is driven by an e-optimal policy. 


Lemma 4.5. Let vr € T*, x € S and r be arbitrary but fixed, and 
suppose that the vector h,, = (xq, uq, ..., x^-i, Or-i, a^r) £ Hr satisfies 

Pfi[lr = hr]>0. 


In this case: 
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(i) J(A,7r,x) > J{X,5,Xr) > J*{X,Xr), where the shifted policy 6 is given 
by 

(4.3) 5t{-\ht)=-Kt+r{-\xQ,aQ,...,Xr-i,ar-i,\\t), tGN,htGMt. 

Consequently: 

(ii) If TT is e-optimal at x, then for each m G N, 

J*iX,Xm)<J*{X,x)+s, P^-a.s. 


Proof, (i) Given an integer n> r, observe that 


e: 


(4.4) 


e^p\xJ2C{Xt,At) 


(. t=o 


exp< X'^C{Xt,At) >I[Ir = hr 


t=r 


On the other hand, an application of the Markov property yields that 


El 


ei^p\xJ2C{Xt,At)\l[Ir = K]\Ir 


t=r 


= I[Ir = hr]Ei 


expi Xj2C{Xt,At 


t=Q 


where policy 6 is as in (4.3). Taking expectation with respect to Pf in both 
sides of this equality, it follows that 


e: 


exp< A J2CiXt,At)\l[Ir = K 


t=r 


t=0 


= Pl[Ir = K]Ei exp{xJ2C{Xt,At) 
which combined with (4.4) leads to 
El 


exp-^ X'^C{Xt,At 
[ t=o 


> e 


— Arlldl DTT 


P:[Ir = hr]E^, 


exp< X^C{Xt,At 


t=o 


This inequality and (2.1) together imply that 


Jn-n(A,7r,x) ^ log(e PJ[4 = hr]) ra-r+ 1 Jn-r+l{X,d,Xr) 

~ A(n + 1) 


n + 1 


n + 1 


n — r + 1 
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and taking limit superior as n increases to oo, this yields J(A, vr, x) > J(A, 6, Xr) > 
J*{X,Xr)', see (2.2). 

(ii) Let TT gV he e-optimal at x, and suppose that P^[Xm = ?/] > 0. Ob¬ 
serving that 

\Xm ~ y\~ \Im — hm], 

h 7TT, G M TTT, , 31771, = y 

the finiteness of Hm implies that \Im = hm] > 0 for some hm G Hm satis¬ 
fying Xm = y- In this case, part (i) yields that J*{X,y) < J(A, 7 r,x), so that 
J*{X,y) < J*{X,x) -|-e, since vr is e-optimal at x. In short 

P^[Xm = y]>0 J*{X,y)<r{X,x) + e, 

and then P^[J*{X,Xm) < J*(A, x) -|-e] = 1. □ 

In the following lemma it is shown that, if e > 0 is small enough, the set of 
minimizers of J*(A,-) is closed under the action of an e-optimal policy and 
that, “essentially,” such a policy belongs to the class V* in Definition 4.1. 
The precise statement of these facts involves the following notation. 

Definition 4.6. (i) Define the positive number as follows: 

(a) If J*{X, •) is constant, set := 1. 

(b) If J*(A, •) is not constant, let 7 ^, i = 0,1,... ,d, be the different values 
of J*(A,-) arranged in increasing order: 

(4.5) 70 < 71 <•••< Ad- 

In this case set 


:=min{7i = l,...,d}. 

(ii) The positive number ^ is given by 

^ = min{^o,6}; 

see (4.2). 

Remark 4.7. Observe that J*{X,y) > J*{X,x) implies that J*{X,y) > 
J*(A, x)-|-^ 1 . Therefore, 

if0<e<e(<6), J*(A,x) + e> J*(A,y) ^ r{X,x) > r{X,y). 

Lemma 4.8. Let x G S be such that J*(A,x) = 70 = min^ J*(A,y), and 
suppose that tt gV is e-optimal at x, where e G (0,^). In this case, for each 
r G N.- 

(i) P-[J*(A,X,)= 7 o] = 1 . 
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(ii) P^[Ar^B*{Xr)] = l. 

Moreover: 

(iii) there exists a policy 5 €V* such that, when the initial state is x, the 
distribution of the state-action process {{Xt,At)} coincides under n and 5, 
that is, Pf = P^. 

Proof, (i) By Lemma 4.5, Pf[J*{X,Xr) < J*{X,x) -|-e] = 1, whereas 
the inclusion e G (0,^) yields that [J*{X,Xr) < J*{X,x) + e] C [J*(A,Xr) < 
J*(A,x)], by Remark 4.7, so that 

p;[j*(A,x,)< J*(A,x)] = l. 

Since J*(A, x) = 70 is the minimum value of J*{X, •), it follows that P^[J*{X, Xr) 
7o] = 1- 

(ii) Suppose that P^lAr = a, X^ = w] > 0. If Pwy{o-) > 0, the Markov 
property yields that Pif[Xr+i = y\Xr = w,Ar = a] =Pwy{a) > 0, so that 

0 < Px[Ar = a,Xr = w\Pf[Xr+l = y\Xr =W,Ar = o] 

= P^[Xr+l =y,Xr = W, Ar = o] 

< Px[^r = W,Xr+l=y] 

and then part (i) yields that J*{X,w) = J*{X,y) = 70 ; since y £ S satisfying 
Pwy{o) > 0, is arbitrary, it follows that J*{X,w) = m.eiK{J*{X,y)\pxy{a) > 0}, 
so that a £ see Definition 4.1. Thus, = a,Xr = w]> Q => a £ 

B*{w), so that 

1 = P^[Ar=a,Xr = w] 

{w,a)£'K 

Y, Pf![Ar=a,Xr = w]=Pf[Ar£B*{Xr)]. 

{w,a)£'K,a£B* (w) 

(iii) Take a fixed stationary policy / satisfying 

f{y)£B*{y), y£S, 

and let the policy 5 be determined as follows: For each t £'N and G 

(4.6) (5i(-|ht) := 7 rt(-|ht) A xq = x,6t{f{xt)\ht) := I whenxo/x. 

In this case, vr and 5 coincide along trajectories starting at x, so that 
Pf = P^, and then P^[At £ B*{Xt)] = Pf[At £ B*{Xt)] = 1 for each t £ N. 
Moreover, by the choice of /, P^[At £ B*{Xt)] = 1 always holds when w x, 
and it follows that 6 £P*; see Definition 4.1. □ 

After the above preliminaries, the proof of the main result of this section 
can be established as follows. 
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Proof of Theorem 4.4. Let x G S' be a minimizer of J*{X, •), so that 
J*(A,x) = 70. 

(i) By Lemma 4.2, there exists a positive integer r such that P^[Xr = 
z] > Px[T = r] > 0, and then Lemma 4.8 (i) yields that J*(A, z) = jo¬ 
in) Let TT be an e-optimal policy, where e < C(< Co); see (4.2) and Defi¬ 
nition 4.6. In this case, using that J*(A,x) = 70 , Lemma 4.3(iii) yields that 
E^lexp{AaJ^[So^lC(Xt,At) - 70 ]}] < 00 ; since P^lJ*(A,Xr) = 70 ] = 1 holds 
for every r G N , by Lemma 4.8(i), it follows that 




exp 


T-l 

AaJ2lC(Xt,At) 


t=o 


J*(X,Xt)] 


< 00 . 


Now, using part (iii) in Lemma 4.8, select 5 €V* such that P^ = PJ, so that 
the above inequality yields 


exp 


T-l 

XaJ2[C{Xt,At) 


t=o 


r{x,xt)] 


< 00 


which, via Definition 4.1(iii) implies that h{x) < 00 ; since h{-) > — 00 , by 
Lemma 4.3(ii), it follows that h{x) is finite. □ 


5. Finiteness of the deviation function on the state space. Following the 
program outlined in Section 2, the objective of this section is to extend the 
finiteness result in Theorem 4.4(ii) to the whole state space. 

Theorem 5.1. For every x G S', /i(x) is finite; see (4-fi- 

Since h(-) > — 00 , by Lemma 4.3(ii), to establish this result it is suffi¬ 
cient to show that /i(x) < 00 for every state x G S. This latter inequality 
holds when x is a minimizer of the optimal value function J*(X, •), by Theo¬ 
rem 4.4(ii), so that the deviation function is certainly finite when J*(A, •) is 
constant. Thus, to prove Theorem 5.1 it must be shown that h(-) < 00 when 
the optimal value function is not constant, and throughout the remainder 
of the section it is supposed that J*{X, •) assumes values ji, i = 0,1,..., d, 
where d > 1, which are arranged in increasing order; see (4.5). With this in 
mind, let the level set Gi be given by 

(5.1) Gj := {x G 5|J*(A,x) = 7 i}, i = 0,...,d. 

Notice that 

d 

i=0 


(5.2) 
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and define the exit time of set Gi by 

(5.3) Tgc :=min{n > ^ Gi}, i = 1,2,... ,d. 

Since the state 2 : in Assumption 2.1 is a minimizer of J*(A,-), by Theo¬ 
rem 4.4, it follows that z ^Gi when 1 < i < d, by (4.5) and (5.1). Therefore, 

(5.3) and (2.4) together imply that 

(5.4) Tg^<T 
and, via Lemma 4.2, this yields 

P;[Tgc = n] < P^Tg^ >n]< P^[T >n]< f5of5^, 

(5.5) 

n G N, i = 1,2,..., d. 

The proof of Theorem 5.1, which parallels the ideas used to establish The¬ 
orem 4.4(ii), relies on the following lemma extending conclusions in Lemmas 
4.3(iii) and 4.8. 


Lemma 5.2. Let e G (0,^) and x G Gi be arbitrary but fixed, where i> 0, 
and suppose that tt gV is e-optimal at state x. In this case, assertions (i)- 
(iv) below are valid. 

(i) E^[exp{XaY.t=o [G(Ai, At) - J*(A, A*)]}] < 00 . 

(ii) Pfi[J*{X,Xr) < 7i] = 1 for each r G N. 

Consequently: 

(iii) When Xq = x and the system is driven by tt, the inclusion At G 
B*{Xt) holds before Tg 9 with probability 1, that is. 


pn 


■TGc — 1 


n [AtGB*iXt)] 


t=Q 


= 1 . 


(iv) Pfi[XT^.G[jl-J,Gk] = l 


Proof, (i) The argument is along the lines in the proof of Lemma 4.3(iii). 
First, notice that (5.3) yields that Xt G Gi ill <t < Tg^, and then, J*(A, Xt) = 
7 i for 0 < t < Tg 9 when Xq G Gi. Therefore, using that x G Gi, via Holder’s 
inequality it follows that 


e: 


Tnc — l 


exp^Aa ^ [G{Xt,At)-r{X,Xt)] 

[ t=o 


= El 


exp I Aa y; [c(x„a)-7<i 


Tnc — l 




t=0 
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SO that 

(5.6) 


n=l 
oo / 

<E 


exp 


n=l 


{ n—1 

XaY, 

t=0 
( n—1 


[C{Xt,At)--ii]\l[TG'^=n] 


exp< A ^ At) - 7i] 


t=o 


(P-[rG.=n]) 


1—a 


Tnc — 1 


( 1 

exp< \a 


[C(Xt,At)-J*(A,Xt)] 


i=0 




n=l 


see (2.1). Since J*{X,x) =^i and vr is e-optimal at x, it follows that, for 
some positive integer no, Jn{X, 7r,x) < n{'yi + e) when n > no- This leads, via 
(5.5), to 

^Xa[Mx,n,x)-nyi]^pn^j.^^ = n])^"“ < ;3i-“(e^“"/3^-")", n > no. 

Observing that the inclusion e G (0,^) yields that < 1 [see (4.2) 

and Definition 4.6], the above-displayed inequality and (5.6) together imply 

rp ^ _ 2 

that .£'J[exp{AaX]t=o [C{Xt,At) — J*(A,Xt)]}] is finite. 

(ii) Let r G N be arbitrary but fixed. Since policy vr is e-optimal at x, 
Lemma 4.5(ii) yields that P^[J*{X,Xr) < J*(A,x) -|- e] = 1, and using the 
inclusion e G (0,^), Remark 4.7 allows us to write P^[J*{X,Xr) < J*(A,x)] = 
1. The conclusion follows since J*{X,x) = 7 ^. 

(hi) Let r, /c G N be such that r < k, and suppose that the pair {w, a) G K 
is such that 


(5.7) P^[Xr = w,Ar = a, Tg- = A:] > 0. 

Since r < k, from the definition of the exit time and the inclusion x a Gi, 
it follows that 


w G Gi, 

and it will be shown that a G To achieve this goal, suppose that y G S' 

satisfies Pwy{o) > 0 and observe that the Markov property yields P^\Xr+i = 
y\Xj. = w,Aj. = a] =Pwy{o) > 0; since P^lXr = w,Ar = a] > 0, by (5.7), it 
follows that 


P^Xr+l =y]> P^Xr+l =y,Xr= W, A, = a] 

> P'^[Xr+l = y\Xr = W,Ar = a]P^[Xr =w,Ar = a]> 0. 
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Therefore, part (ii) yields that J*{X,y) < 7 ^. Since y € S satisfying Pwy{a) > 
0 is arbitrary, it follows that 

meLx{J* {X,y)\p^y{a) > 0} < 7 * = J*{X,w), 

where the inclusion w € Gi was used to set the equality. Then 

max{J*{X,y)\pwy{a) > 0} = J*{X,w), 

by Lemma 3.1, so that a G see Definition 4.1(i). In short, when r < k, 

P^[Xr = w,Ar = a,TG<r = k]> 0 a€B*{w), 

and it follows that 

P;[TGc = fc]= ^ P^[Xr = w,Ar = a,TG^^=k] 

(ti;,a)GK 

^ P^[Xr = W,Ar=a,TG<l=k] 

(ii;,a)EK,aEB* (ii;) 

and then 

PnTG<r = k]= P^[[Ar G B*{Xr)] H [Tg^ = k]]. 

Since this equality holds whenever r < k, it follows that 

rfc-i 


P^[TG<;=k]=P^ 


f| [A G B*{Xr)] n [Tg^ = k] 


_r=0 

^Tqc-1 


= P’’ 


fl [AreB*{Xr)]n[TG^^=k] 
. r=0 


Summing up over all positive integers k, this yields 


p;[Tgc<oo] = p; 


.Tnc — l 


fl [Ar€B*{Xr)]n[TG^^<^] 

r=0 


and the conclusion follows since, by (5.5), P^[Tg<^ < 00 ] = 1. 

(iv) Notice that (4.5), (5.1) and part (ii) together yield that, for each 
positive integer r. 


pT. 


= 1 . 


Xr G f Gfc 
fc =0 

On the other hand, from (5.3) it follows that Xr ^ Gi on the event [Tg 9 = r], 
so that the above displayed equation implies that 


pT. 


i—1 

Tgi = r, Xtqc G f Gfc 
fc =0 


= P^ 


i—1 

rG|=r,X,G U Gk 

k=0 


= p;[rG= = r]. 
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Hence, 




Tgc < oo,Xt^ 


2—1 

£ u 

k=0 


r=l 


Tcr = r, Xt^ 


2 — 1 

G u 

fc =0 


= Y.P^APG<r=r]=P:[TG<i<^l 

T=1 

and the conclusion follows using that T^c is finite with probability 1. □ 


Proof of Theorem 5.1. For each m = 0,1,2,..., d, consider the fol¬ 
lowing claim: 


(Cm) /i(x) ^ cxD for every x G 


Observe that the conclusion of Theorem 5.1 is equivalent to the truth 
of every (Cm) a fact that will be established by induction. To begin with, 
notice that (Cq) holds, by Theorem 4.4(ii); see (4.5) and (5.1). Assume now 
that i < d is a positive integer such that (Cm) holds when m <i. Under this 
condition it will be proved that (Cj) is valid. From this induction hypothesis, 
the definition of h{-) in (4.1) yields that for each y G there exists a 

policy 6^ such that 


(5.8) sy eV* 


and 


e: 


,sy 


exp 


T-l 

Aa^[C(At,At) 


t=o 




< oo. 


Next, let X € Gi and e G (0,^) be arbitrary but fixed, and select a policy 
TT G P which is e-optimal at x. Given a stationary policy / satisfying /{w) G 
B*{w) for every tc G S', define the new policy 6 as follows: For each t G N 
and ht G Ht: 


(a) If xo^x, 6t{{f{xt)}\ht) = 1. 

(b) If xo = X and Xk G Gi for every k = 1,2,... ,t, then (5t(-|hi) = 7 rt(-|ht). 

(c) If xo = X and, for some positive integer r <t, Xk &Gi for every k <r 
and Xr ^Gi, then 

2 — 1 

(it(-|h 4 ) = (5j%(-|xr,...,Xi_i,at_i,Xi) if x,. G |J Gj, 

j=0 

i 

6t{{fixt)}\ht) = 1 when Xr G S\ y Gj. 

1=0 

A controller driving the system via policy <5 operates as follows: When the 
initial state is Xq 7 ^ x, at each decision time the actions are selected ac¬ 
cording to /. On the other hand, when Xq = x, she uses policy vr to choose 
actions while the system stays in Gj, but when the system first leaves Gj at 
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time Tq-t = k, then the decision maker “forgets” the history observed before 
time k and, as if the process had started again, she switches to policy if 
Xk belongs to some set Gm with m < i, or to policy / otherwise. Now, let 
f G N be fixed. When the initial state is x, 6 and tt coincide while the system 
stays in G*, by part (b) so that Lemma 5.2 (hi) yields that 

AteB*{Xt) 

holds on [Tg-c < t] P^-a.s. whereas, by part (c), the choice of / and the 
inclusion in (5.8) imply that the above displayed relation also occurs P^-a.s. 
on the event [Tg= > t]. When Xq = w^x, from the choice of / and part (a) 
in the above definition, it follows that P^[At G B*{Xt)] = 1, so that 

(5.9) 6£P*; 


see Dehnition 4.1. Moreover, using again that 6 and vr coincide before 
when X is the initial state, it follows that the event S Ufc=o has 

i 

the same probability with respect to P^ and P^, whereas the expectation 

Tqc-I 

of exp{AaX)t=o [C{Xt,At) — 7 *]} with respect to these measures coincides. 
Thus, by parts (i) and (iv) of Lemma 5.2, 


(5.10) 
and 

(5.11) 


pd 


i—1 

^Tqc £ [J G'fc 

fc =0 


= 1 


Et 


exp Aa ^ lC(Xt,At)-r(A,Xt)] 
I t=o 


< 00 . 


Next, it will be shown that 

T-l 


(5.12) 


Et 


exp 


( T-l 


[c{Xt,At)-r{\,Xt)] 


t=o 


< 00 . 


To achieve this goal, notice that 

)|Aa ^ [CiXt,At) - J*(A, Xt)]]l[TG<i = T] 


Et 


exp 




t=o 


= Et 


<K 




exp 


exp 


W t 


lC(XuAt) - J"(A,X,)] /[To. = T] 


t=0 
Tnc — l 


E 


[ciXt,At)-r{\,Xt)] 


t=0 
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and then 
(5.13) Ei 


exp 


{ T-l 

t=o 


[CiXt,At)-riX,Xt)]\l[TG^=T] 


< oo, 


by (5.11). Next, observe that for each positive integer r, 


Ei 


exp 


= exp 


I Aa ^ [C{Xt, At) - r{X, Xi)] |/[r = Tgc < T] |4 
I Xa'^[C{Xt,At) - J*(A,X0]|l[r = < T] 

[CiXt,At)-riX,Xt)] 


t=o 


X E, 


exp 


( T-l 
I t=r 


= exp 


i: 

l t=0 


[CiXt,At) - riX,Xt)] \l[r = Tcf < T] 


xEt 


exp 


{ T-l 

Aa^ 

t=r 


[ciXt,At) - rix,xt)] 


and that, on the event [Tg^ = r], Xr lies in U^Aq Gi PG&.s., by (5.10). Thus, 
part (c) in the definition of policy 6 yields, via the Markov property, that 
the following holds with probability 1 with respect to P^: 


I[r = TGf<T]E: 


{ T-l 

XaJ2 


[ciXt,At)-rix,Xt)] 


= I 


X E 


exp 

t=r 
i—1 

= TG’l<T,Xre[jGi 

i=0 . 

{ T-l 

XaY,[G{XGAt)-r{X,Xt)] 

t=o 


§Xr 

'Xr 


<MI[r = TG-<Tl 


where M := max{£;^ [exp{AaX; 4 =o [C{Xt) - V{Xt)]}\\y G [j)=oGi} < oo, 
and the inequality is due to the induction hypothesis. Combining the last 
two displayed relations, it follows that 


Et 


expl Aa ^ [G{Xt,At) - riX,Xt)] \l[r = Tg't < T]\Ir 


T-l 




1=0 
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< MI[r = Tg 9 < T] exp 
so that 


|Aa [C{XuAt)-r{\,Xt)]^, Pi-a.s., 


El 


exp 


{ T-l 

t=0 


[C{Xt,At) - r{X, Xt)] \l[r = Tg't < T] 


<MEi 


Tnc — l 


/[r = rGc<r]expiAa ^ [C(X4, - J*(A,X*)] 

I i=0 


Since this inequality is valid for every positive integer r and is finite 
P^-a.s., it follows that 

)| Aa [CiXuAt) - riX,Xt)]]l[TG^^ < T] 


Et 


exp 




t=o 


<MEi 


Tgc-1 


/[Tg9 < T] exp 


W t 


\c(Xt,At)-.r{\,Xt)\ 


t=0 


<MEi 


exp Aa ^ lC(Xt,At)-J*(X,Xt)] 
I t=o 


Since M is finite, this relation and (5.11) yield that E^[exp{XaJ2j=o[C{Xt, At) 
J*{X,Xt)]}I[TG'r < T]] is finite, and this fact, (5.4) and (5.13) together imply 
that (5.12) occurs. To conclude, observe that, by the definition of function 
h{-) in (4.1), the inclusion in (5.9) and (5.12) together yield that h{x) < oo; 
since h{-) > —oo, by Lemma 4.3, it follows that h{x) is finite and, since 
X £Gi is arbitrary, this shows that claim (Cj) holds, completing the induc¬ 
tion proof. □ 


6. A key inequality. This section contains the last technical tool that, 
together with the finiteness result in Theorem 5.1, will be used to establish 
Theorem 3.5. The main objective is to establish the following. 

Theorem 6.1. Let z be the state in Assumption 2.1. In this case, the 
deviation function in (4.1) satisfies that 

h{z) < 0. 

The proof of this theorem relies on Lemma 6.3, whose conclusions involve 
the random times at which the system occupies the distinguished state z in 
Assumption 2.1. 
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Definition 6.2. (i) The sequence {T^} of successive arrival times to 

state z is recursively determined as follows: 


Ti\=T and := min{n > Tfc_i|Xn = z}, k>l] 

see (2.4) for the definition of T. 

(ii) Given e > 0, define '(/’(e) by 


where, as before, 70 = J*{X,z). 


T-l 


exp<^ A [C{Xt,At) - 70 - 2s] 
[ t=o 


It is not difficult to see that each Tk is a stopping time with respect to the 
family of u-fields {a{In)}, that is, the event [T^, = m] always lies in a{Im), 
and that 


( 6 . 1 ) 


Tk>k, A: = 1,2,.. 


Lemma 6.3. Let s G (0,.^) be fixed, and suppose that n gV is e-optimal 
at state z. In this case: 


(i) For each positive integer k 

( 6 . 2 ) 


f T’fe-l 

exp< A Y - 70 - 2s] 

[ t=o 


(ii) There exists no such that, for every k>no 
E. 


{ Tk-l 

^ Y \C{Xt,At)--io-2s] 

t=o 


< 


Aefc 


1 - ■ 


Consequently: 

(hi) fi{s) < -s. 


Proof, (i) The argument is by induction. Since vr is e-optimal at z and 
s G (0,^), Theorem 4.4(i) and Lemma 4.8(iii) together yield that there exists 
5 gV* such that Pfi = P^- In this case, using that Ti = T, it follows that 


( Ti-l 

exp< A Y [C{Xt,At) - 70 - 2s] 


t=o 




T-l 

XYiCiXuAt) 


t=o 


70 - 2s] 


>eA7(£)^ 


where the inequality is due to Definition 6.2(ii), so that (6.2) is valid for k = 
1. Let n be an integer larger than 1, and suppose that (6.2) holds when k <n. 
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In this situation, take a positive integer r and = (xq, oq, . • ■, Xr-i,a,r-i,Xr} € 
Hr satisfying that 

(6.3) H^[T„_i = r,/r = hr] >0; 

since ^ when Tn-i is finite, it follows that Xr = 2 . Notice now that 

Tn > Tn-i so that, on the event [T„_i = r], 

T„-l 

^ lC(Xt,At)-7o-2e] 
t=o 


r-l T„-l 

= Y^[C{Xt,At)-7o-2e] + ^ [C{Xt,At)-7o-2e] 

t=0 t=r 

and an application of the Markov property yields, via Dehnition 6.2, that 

r T„-i 

exp< A [C{Xt,At) - 70 - 2s] 


El 


t=0 

( r—1 


Tn—\ — Ir — hy. 


= exp<^ \ Y\C{Xt,At)- 7 Q- 2 e\ 


t=o 


X Et 


Ti 


exp< XY[C{Xt,At) -70 - 2s] 


t=o 


where the shifted policy S is as in (4.3). Since vr is e-optimal at z, Lemma 4.5(i) 
yields that J(A,d,z) < J(A,7r,z) < J*(A,z) +£, so that 5 itself is e-optimal 
at z. Therefore, applying the case A; = 1 of (6.2) to this policy 5, it follows 
that El[eyi-p{A'YdLo[C{Xt, At) — 7o — 2e]}] > which combined with 

the above displayed equation leads to 


El 


exp<^ A Y [C{Xt,At) - 70 - 2e] 


t=o 

( r—1 


Tn—1 — f'llr — hy^ 




>exp<^ AY[C{Xt,At)- 7 q- 2 s] 

I t=o ) 

Since this inequality is valid whenever (6.3) holds, it follows that 
( T„-l 

exp< A Y [C{Xt,At) - 70 - 2e] 

I t=o 


El 


> El 


f Tn-l—l 

exp< A Y [C{Xt, At) - 70 - 2e] 
I t=o 


gAV'(e) > gnA7(e) ^ 


where the induction hypothesis was used to set the second inequality. This 
establishes the case k = n of (6.2) and completes the induction argument. 
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(ii) Since vr is e-optimal at z, there exists a positive integer no such that 
Jn(A, TT, z) < n{J*{X, z) + e) = n( 7 o + e) when n > uq. Observing that 


e: 


n—1 


exp< A ^ [C{Xt,At) - 70 - 2e] 


t=o 


_ g—2nAe—n7o^7r 


n—1 


exp\xJ2C{Xt,At) 


t=o 


it follows that 


(6.4) El 


_ g-2nAe-n70gJn(A,7r,2:) 


n—1 


exp< A [C{Xt,At) - 70 - 2e] 


t=Q 


< e n > no- 


Next, let the positive integer k and p G (0,1) be fixed. In this case, (6.1) and 
Holder’s inequality yield that 


El 


{ Tfe-i 

V ^ [C{Xt,At) - 70 - 2e] 

t=o 


n=k 
oo / 

<E 


n—1 


exp<j Ap ^ [C(Xi, Hi) - 70 - 2^] \l[Tk = 


n=k 

oo 


t=0 

n—1 




expp A)-70 - 26] 




n=k 


t=0 

n—1 


■■ n\ 


{P;i[Tk = n])^-P 


exp< A [C{Xt,At) - 70 - 2e] 
I t=o 


Combining this with (6.4) it follows that 
exp< Xp Y lC{Xt, At) - 70 - 2e] 


e: 


t=0 


°° p—keXp 

—nepX _ 


<Ee 


n=k 


1 - e-^^P ’ 

k > no. 


Given a sequence {pm} of positive numbers increasing to 1, this inequality 
implies, via Fatou’s lemma, that for every positive integer k > ng. 


exp<j A Y [C{XuAt) - 70 - 2e] 
t=o 

( Ti,-i 

= El liminf expi Xpm E At) - 70 - 2e] 

m.—^ 


t=o 























RISK-SENSITIVE CONTROL OF MARKOV CHAINS 


31 


{ Tk-l 

^Pm ^ [C{Xt,At) - 70 - 2e] 

t=0 

p — 


and then 

e: 


< liminf-r— 

m—*00 1 — ^ — EApn 


{ Tk-l 

^ ^ [C{Xt,At) --fo - 2e] 

t=o 


< 


fceA 


1 - ’ 


k > no- 


(iii) Observe that parts (i) and (ii) together yield that < e j (1 — 

e“^^) when k is large enough, and in this case 

V’(e) < -e - ^ log(l - e^^), 

so that the conclusion follows letting k increase to oo. □ 


Proof of Theorem 6.1. It will be shown that there exists a policy 5 
satisfying 


(6.5) 


5eV* and Ei 


T-l 


exp< A ^ [C{Xt,At) - 7 o] 


t=o 


< 1 . 


Assuming that such a policy exists, Theorem 6.1 can be established as fol¬ 
lows: First, recall that satisfies the min-max equation in (3.3), by 

Lemma 3.1, and that B* = by Definition 4.1(i). Therefore, the in¬ 

clusion d G P* yields that P^lAt G Bj*f^x,-){Xt)] = 1 for every f G N, by Defi¬ 
nition 4.1(ii), so that an application of Lemma 3.4(i) implies that, for each 
ra GN, 

r{x,Xn) < r{x,Xo) = r{x,z) = jo, ^’i-a.s. 

Since 70 is the minimum value of J*{X,-), it follows that P^[J*{X,Xn) = 
7 o] = 1 for every n G N, so that the inequality in (6.5) is equivalent to 


Et 


T-l 


explxJ2[C{Xt,At)-r{X,Xt)] 
[ t=o 


< 1 . 


From this point, an application of Holder’s inequality yields that 


Et 


exp 


T-l 

XaJ2[C{Xt,At)-J*{X,Xt)] 


t=o 


< 




T-l 

Xj2[CiXt,At) 


t=o 


r{\Xt)] 


< 1 ; 
















32 


R. CAVAZOS-CADENA AND D. HERNANDEZ-HERNAnDEZ 


recall that the fixed number a lies in (0,1). Combining this inequality with 
the inclusion 5 ^V* and (4.1), it follows that 


h{z)<j\oglEt 


T-l 


exp\XaJ2[C{Xt,At)-J*{X,Xt)] 


t=o 


< 0 , 


completing the proof of Theorem 6.1. To conclude, (6.5) will be established. 
Let {ek) C (0,^) be a sequence converging to zero and notice that, for each 
k gN, Lemma 6.3(iii) yields that g-A£fc/ 2 _ £)efi_ 

nition 6.2, for every k gN there exists a policy vr^ G V* such that 


( 6 . 6 ) 



T-l 

XY,[C{Xt,At) 


i=0 


70 - 2efc] 


< g~'^£;s/2 


Let P(A) be the class of probability measures defined on the subsets of the 
action space A. For each r G N and G Hj., {ir^ {■\hr)\k G N} is a sequence 
in P(A) and, since A is finite, there exists (ir(’|hr-) G P(A), as well as a 
subsequence of {vr^}, denoted by {vr™}, such that 


(6.7) 


lim 7r™(F|h^)=:(i^(F|h^), F C A. 

m —>^oo 


Moreover, since U^o denumerable, applying Cantor’s diagonal method 
it can be assumed that this convergence holds for every F C A, r gN and 
hj. G Hr) and it will be shown that 6 := {5^} satisfies (6.5). To achieve this 
goal, first notice that 7r^(A(xr)|hr) = 1 always holds, since gV* C V, so 
that (6.7) yields that (5r(^(®r)|hr) = 1 for every r G N and G Hr, that is, 
5 is a policy. Next, observe that the equality 


P^[Ir = hr] = 6x,xo'^o{ao\xo)PxoxAo-o)T^l{ai\xo,ao,Xi) X • • • 

X TTr—1 (Ur—1 I^^O) ^0) • • • ) Xn—l^Pxn—iXn (^n—a) 


is always valid, where 5x,y := 1 if x = y and 6x,y ■= 0 otherwise. Combining 
this equation with (6.7), it follows that for every x G S, r gN and D : Hr —> M, 


(6.8) ^im^F;r[^(^.)] 

In particular, for each n G N and x G S, [An G B*{Xn)] = lim^^oo Px”' [An G 
B*{Xn)\ = 1, where the inclusion tt™ G V* was used to set the second equal¬ 
ity, so that 

(6.9) 5gV*] 

see Dehnition 4.1. Moreover, (6.8) yields that for every r G N, 
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( 6 . 10 ) 


Observing that 


= Et 


T-l 


T-1 


exp A^[C'(Xi,Ai)- 7 o] /[r<^ 


t=Q 


exp<^ A ^ [C{Xt,At) - 70 - 2ek] \l[T < ' 


t=o 


> expj A ^ [C{Xt,At) - 70 ] |/[r < r], 


it follows that 


e: 


T-l 


exp< A [C{Xt,At) - 70 ] \l[T < r] 


t=Q 


^ ^2\rem 


T-l 


exp IA [C{Xt, At) - 70 - 2em]y[T < r] 


SO that E^"[eiip{XY:fEQ^[CiXt,At)--fo]}I[T < r]] < [see ( 6 . 6 )]; 

since {em} converges to zero, this inequality and ( 6 . 10 ) together yield that 
E^[exp{\J2jS(^[C{Xt, At) — 7 o]}/[r < rj] < 1 for every r G N and, via the 
monotone convergence theorem, this implies that 



Et 


r T-l 

exp< A [C{Xt, At) - 70 ] 


t=o 


< 1 . 


Combining this inequality with the inclusion in (6.9), it follows that the 
conditions in (6.5) are satished by policy 6. □ 


7. Proof of the main result. After the previous preliminaries, in this 
section the characterization result in Theorem 3.5 will be finally proved. 
The argument combines Theorems 5.1 and 6.1 with the properties of the 
policies in V* established in the following lemma. 


Lemma 7.1. Given a policy n gV* , suppose that for some {x,a) G K. the 
inequality Pf[AQ = a] > 0 holds and define the shifted policy 5 by 

(7.1) (5t(-|ht) =7rt+i(-|x,a,ht). 

In this case, for each y G S satisfying that pxy{a) > 0, assertions (i) and (ii) 
below hold. 
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(i) Py[At e B*{Xt)] = 1 for every i G N. 

(ii) There exists 5 £V* such that Py = P'y. 

Proof, (i) Suppose that Pxy{a) > 0 and observe that Pf[Xi = y,AQ = 
a] = Pf[Xi = y\Ao = a\Pf[AQ = a] = pxy{a)Pf[AQ = a] > 0. Thus, for every 
teN, 

P^[At+i^B*{Xt+i)] 

> P^[At+, i B*{Xt+i),Xi = y, Ao = a] 

= Pf[At+i i B*{Xt+i)\Xi =y,Ao = a]Pf[Xi = y,Ao = a]. 

Observing that Pf[At+i ^ B*{Xt+i)\Xi =y,Ao = a] =P^[At ^ .B*(Xt)], which 
is due to the definition of policy <5 and the Markov property, it follows that 

P;[At+i ^ B%Xt+,)] > P^[At i B*{Xt)]P:[X^ = y,Ao = a]. 

Since Pf[[At+i ^ B*{Xt+i)] = 0, by the inclusion tt G P*, and Pf[Xi = 
y,Ao = a] > 0, it follows that Py[At fz B*{Xty\ = 0, that is, Py[At G B*{Xt)\ = 

1. 

(ii) Pick a stationary policy / such that f{y) G B*{y) for each y G S, and 
dehne the policy 6 as follows: For each t G N and G Hi, 

5i(-|hi) = (5i(-|hi) Apxxo{a)>0, 

^t{{fixt)}\iit) = 1 ifpxxo{a) = 0- 

From this definition it follows that P^ = P^ when Pxw{o) > 0, and P^ = 
Pi if Pxw{a) = 0. Therefore, P^ = P^, since Pxy{o) > 0, whereas the choice 
of / and part (i) together imply that [Ai G B*{Xt)\ = 1 always holds, that 
is, 6 gP*, by Definition 4.1. □ 

Proof of Theorem 3.5. Recall that the fixed number a belongs to 
(0,1) and let g{-) be the function defined in (3.6). It will be shown that this 
function belongs to the family Q in Definition 3.2. Using that a is positive, 
from Lemma 3.1 it is not difficult to see the min-max equation (3.3) holds, 
so that g{-) satishes the first requirement in Definition 3.2. Moreover, for 
each (x,a) G H, the equality g{x) = max{g{y)\pxy{a) > 0} is equivalent to 
J*{X,x) = may:{J*{X,y)\pxy{a) > 0}, so that 

Pg(a:) = Pj*(A,.) =P*(x), xGS- 

see (3.5) and Definition 4.1(i). It will be verified that the second part of Def¬ 
inition 3.2 is satished by the pair (y('), h{-)), where /i(-) is given in (4.1). To 
achieve this goal, first notice that this function h{-) is finite, by Theorem 5.1. 
Next, select a policy tt gP* and let x G S' be arbitrary. For each action a sat¬ 
isfying that Pf[AQ = a] > 0, it follows that a G B*{x), since tt gP*, whereas 
the Markov property yields 
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EZ 


exp 


{ T-l 

t=0 


[ c { Xt , At )- r {\, Xt )] 


An = a 


= EZ 


exp 


+ EZ 


I Aa ^ [C{Xt, At) - J*(A,Xt)] |/[r = 1] 1^0 = a 

IAa ^ [C{Xu At) - J*(A, Xt)]\l[T > 1]|AIq = a 
t=o } 


exp 




_|_ ^Xa[C{x,a)-J*{\,x)] 

exp 


X ^^Pxy{oZ)Ey 




{ T-l 

XaJ2 

t=o 


[c{Xt,At)-r{x,Xt)] 


where <5 is the shifted policy in (7.1). By Lemma 7.1, there exists 6 gV* 
such that Py = Py when Pxy{a) > 0, so that 


exp 


{ T-l 

t=0 


[c{Xt,At)-r{x,Xt)] 


Ao = a 


_ ^\a[C{x,a)-J* 

_|_ gAa[C(a:,a)-J*(A,a;)] 


J2Pxy{a)El 




exp j Aa ^ [C{Xt, At) - r{X,Xt)] 


T-l 




t=0 


> gAa[C(x,a)-J‘(A,x)]^^^^^^ ^ ^\a[C{x,a)-J* i\,x)] ^ , 

y¥=z 


where the inequality is due to the inclusion 5 gV*; see (4.1). Recalling that 
a G B*(x), this leads to 


EZ 


exp 


{ T-l 

AaE 

t=o 


lC(Xt,At)-J*(X,Xt)] \ Ao = a 


> min 
beB*(x) 


gAa[C(a;,&)-J*(A,x)]^^^('^^ 


+ e 


XalC(x,b)-J*(X,x)] 


'y^,Pxy{b) 


AHy) 
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and then, since this inequality holds for every action a satisfying that P^[Aq = 
a] > 0, 


T-l 




t=o 


> min 
b&B*{x) 




^gAa[C(.,fe)-J*(A.x)]^p^^(^)gAMy) 

Using that tt £V* and x G S' are arbitrary, via (4.1), this inequality yields 


(7.2) 


eAh(x)> 

beB*(x) 


gAa[C(x,fe)-J‘(A,x)]^^^(^) 


+ e 


\a[C(x,b)—J* (A,a;)] 


Pxy{b) 


AKv) 


y¥=z 


X £ S. 


On the other hand, by Theorem 6.1 

e^hiz) < ^ 

which, combined with (7.2), implies that for every x £ S, 


gAh(x) > 

b&B*{x) 


^\alC{x,b)-J*{\,x)] 


Xh{y) 


and then, multiplying both sides of this inequality by "llicill ^ 


^X 9 (x)+\h(x) ^ min 
beB*{x) 


^XaC{x,b)+il-a)\\C\\ 


'^Pxyib) 


since aC{x, 6) + (1 — a)||C'|| > C{x, h), this yields that 


^Xg{x)+Xh{x) > 

“ b&B*{x) 


AC{x,b) 




AHy) 


AKy) 


x£S. 


Therefore, the pair h{-)) satisfies the second condition of Definition 3.2, 
and it follows that 


ar{X,-) + il-a)\\C\\£g. 

This inclusion is valid for each a G (0,1), so that 

J* {X,x) > inf g{x), x £ S, 
g&Q 

and, via Lemma 3.4, this implies that J*{X,x) = inig^g g{x) for every state 
X, completing the proof of Theorem 3.5. □ 
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Remark 7.2. As a consequence of the results presented in this paper, 
two main problems remain open: 

(i) Find (nontrivial) conditions under which the optimal value function 
J(A, •) belongs to set Q and there exists a solution to the dynamic program¬ 
ming equation. 

(ii) Find an efficient algorithm to approximate the optimal value function 
and obtain e-optimal stationary policies. 

Acknowledgment. The authors are deeply grateful to the referee and the 
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